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Abstract —This paper presents a hardware architecture of 
fast simplified successive cancellation (fast-SSC) algorithm for 
polar codes, which significantly rednces the decoding latency and 
dramatically Increases the throughpnt. Algorithmically, fast-SSC 
algorithm suffers from the fact that its decoder scheduling and 
the consequent architecture depends on the code rate; this is 
a challenge for rate-compatlhle system. However, hy exploiting 
the homogeneousness between the decoding processes of fast 
constitnent polar codes and regular polar codes, the presented 
design is compatible with any rate. The schednling plan and the 
intendedly designed process core are also described. Resnlts show 
that, compared with the state-of-art decoder, proposed design 
can achieve at least 60% latency reduction for the codes with 
length N — 1024. By using Nangate FreePDK A^nm process, 
proposed design can reach throughpnt np to 5.81 Ghps and 
2.01 Ghps for (1024, 870) and (1024, 512) polar code, respectively. 
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Fig. I. |(a)| Encoder of (8,4) polar code, |(b)| Tree presentation of (8,4) SC 
decoder 


I. Introduction 

Recently, polar codes JT] have received significant attention 
due to its capability to achieve the capacity of binary-input 
memoryless symmetric channels with low-complexity encod¬ 
ing and decoding schemes. Successive cancellation (SC) IB, 
list successive cancellation (List-SC) H and belief propagation 
(BP) 13 are the three most common proposed decoding 
schemes. Among these, SC decoder is the most promising for 
practical hardware implementation since its low 0{NlogN) 
complexity, where N is the length of the code. Thus, many 
relevant hardware designs are proposed a El 0. 

However, algorithmically, SC decoder suffers from high 
latency. Typically, for conventional SC decoder, its latency 
{2N — 2) increases linearly with respect to the code length. 
This is a significant challenge since polar codes work well 
only at very long code lengths. A lot of works have been 
done to reduce the latency of SC decoder from both hardware 
and algorithm aspects. In ||7], a pre-computation method is 
used to reduce decoding latency from 2N — 2 to N — 1. 
In in, three approaches, the dedicated 2-bit decoder for the 
last stage of SC decoding, overlapped-scheduling and look¬ 
ahead techniques are applied, vyhich eventually results in a 
3X/4 — 1 latency. In ||3 and llTOI . by observing the tree archi¬ 
tecture of SC decoding, certain patterns of constituent codes 
are found. These constituent codes can feed back the hard 
decision information immediately without traversal, which can 
significantly reduce the latency of decoding some polar codes 
with a given architecture. This approach is refer to as fast- 
SSC decoder. Moreover, a processors-array based structure for 
EPGA implementation is also proposed in Eol. 

In this paper, a novel low latency hardware architecture of 
polar code decoding using fast-SSC algorithm is presented. 
Although fast-SSC algorithm naturally lacks flexibility for 
multiple rates, the proposed design overcomes this disadvan¬ 
tage by utilizing the similarity between the decoding processes 
of fast constituent polar codes and regular polar codes. Corre¬ 
sponding scheduling plan is presented in this paper. We also 
provide the design details of the processing unit (PU) which 
is compatible with both regular polar code and constituent 
polar code. The comparisons with other commonly discussed 
SC decoders are given. For example. Compared with the 


2b-SC-Precomputation decoder, the fastest ASIC design of 
SC decoder to best of our knowledge, the proposed design 
can achieve at least 60% latency reduction for polar code 
with length N = 1024. The analysis of latency reduction 
with respect to code rates is also presented. It shows pro¬ 
posed architecture can yield a significant latency reduction 
especially at high code rate (code rate > 0.8). This is very 
promising for modern communication or data storage systems 
where high rate codes are desired. Synthesis results using 
Nangate FreePDK Abnm process shows the proposed 
design can reach throughput of up to 5.81 Gbps and 2.01 Gbps 
for (1024,870) and (1024,512) polar codes, respectively. 

This paper is organized as follows. The relative background 
are reviewed in section [III Then, the hardware implementation 
of proposed system is described in section |lll| After that, 
the synthesis results and relevant comparisons are discussed in 
section Finally, the conclusion is in section lYl 


II. Background 


A. Polar Code and Tree analysis of SC Decoding 


As described in IB, a polar code is constructed by exploit¬ 
ing channel polarization. Mathematically, polar codes are linear 
block codes of length W = 2”. The transmitted codeword 
X = tXn) is computed by tc = uG where 

G — F®^, and F®'^ is the m-th Kronecker power of 

F = ^ 2 ■ Each row of G is corresponding to an equivalent 


polarizing channel. For an (X, k) polar code, k bits that carry 
source information in u are transmitted using the most reliable 
channels. These are refer to information bits. While the rest 
N — k bits, called frozen bits, are set to zeros and are placed 
at the least reliable channels. Determining the location of the 
information and frozen bits depends on the channel model and 
the channel quality is investigated in EB- Fig. [Ta| shows an 
example of (8,4) polar code encoder, where the black and 
white nodes stand for the information bits and frozen bits, 
respectively. 


Polar codes can be decoded by recursively applying suc¬ 
cessive cancellation to estimate Ui using the channel output 








































and the previously estimated bits Uq~^. This approach 
is naturally represented by a binary tree whose each node 
corresponds to a constituent code. The number of bits in one 
constituent node in stage m{m = 0,1,2...) N'^ is equal to 
pjg [i^ shows an example of (8,4) polar code, a stands 
for the soft reliability value, typically is log-likelihood ratio 
(LLR), and (3 stands for the hard decision. ai and are the 
message passing from parent node to left and right child, and 
can be computed according to Eq. ([T]i and Eq. (|2]i, respectively. 

«;[*] = f{ay\i],ay[i + iV""/2]) 

= sign(ay\i])sign(ay\i + N'^/2]) (1) 

• miniM]l\av[i + N^/2]\) 
ar[i\ = g{f3i[i - N'^/2],ay[i],ay[i - iV^/2]) 

= _ 7V-/2] + ^ ^ 

At stage 0, /3„ of a frozen node is always zero, and for 
information bit its value is calculated by threshold detection 
of the soft reliability according to 

/?. = Mo.) = { f; (3) 


At intermediate stages, /3„ can be recursively calculated by 


I3v[i] 


N ^/2 

/3rP — A^^/2] otherwise 


(4) 


B. Fast-SSC Algorithm 

The main idea of fast-SSC algorithm is illustrated in ||7], S 
and Qol. By finding some certain pattern constituent polar 
codes, the hard decision j3v of each constituent node can be 
determined immediately, without traversing the entire subtree, 
once the constituent polar code is activated. Eor a length 
N constituent code in non-systematic polar codes, un is 
calculated by un — f3vN ■ Gn, where Gn is the generator 
matrix for length N polar code. We adopt four kinds of 
constituent polar codes in our design. These are A/"°, JV^, 
j^/SPC j^j-REP^ which are called fast constituent polar 
codes. 


and A/”^ are refer to those constituent codes which only 
contain frozen bits or information bits, respectively. Eor AP 
codes, we can set 13^ to 0 immediately. Eor node, (3^ can 
be directly decided via threshold detection Eq. (I3]l. and 

j\fREP kinds constituent codes containing both frozen 

bits and information bits. In a length N codes, only 

the first bit is frozen. It renders the constituent codes as a rate 
{N — \)/N single parity check (SPC) code. This code can be 
decoded by doing parity check with the least reliable bit which 
has the minimum absolute value of LLR. Eirst, get the hard 
decision HDy of f3v via threshold detection. Then, calculated 
the parity by 

TV'" 

parity = E mD,[i]. ( 5 ) 

i=l 

and, find the index of the least reliable bit via 

j = argmiii\ay[i]\. (6) 

i 

Eventually, (3^ is decided by 


Pyli] 


HDy[i] (B parity, when i = j 
HDy[i], otherwise 


(7) 


In a length N codes, only the last bit is information bit. 

In this case, all the j3y [i] should be the same and are reflections 
of the information contained in the only one information bit. 



Thus, the decoding algorithm starts by summing all input LLRs 
and (3y is calculated as 


j 0, when ^ 0; 

1 1, otherwise 


(8) 


Eig.|2]gives the examples of tree presentations of these four 
kinds constituent polar codes. 


III. Hardware Implementation 

In this section, a novel hardware implementation of fast- 
SSC decoder is presented. Eor a polar code with a given length, 
different code rate yields different distribution of constituent 
polar codes. A thoughtfully-composed architecture should have 
the capability and flexibility to deal with different rates. 
By exploiting the homogeneousness between the decoding 
processes of fast constituent polar codes and regular polar 
codes, our design supports a variety of rates. The scheduling 
scheme based on the proposed architecture is also discussed. 
Additionally, we develop an approach for sharing and reusing 
computational elements to achieve higher hardware efficiency. 


A. System Overview 

As introduced in 0, tree architecture or line architecture 
for SC decoder is the most common. Line architecture has 
a higher hardware utilization but needs increased complexity 
in control module and memory access. Thus, we adopt tree 
architecture in our design. Eig. [3] shows an overview of 
proposed system when code length = 16. Processing unit 
(PU) performs the / and g functions in Eq. ([T]) and Eq. (|2|l, 
respectively, and its arithmetic part is used to decode 
and as well. Pre-computation technique is also used, 

which allows the / and g functions update in the same 
clock cycle. The PU used in stage 0 has a slight difference 
with ordinary PU. We denote it with PUq in the figure. 
According to Eq. ®, the minimum LLR value needs to be 
found. The comparator tree is used to perform this since it 
inherently exists in the tree architecture of PUs. A judicious 
scheduling permits obtaining the minimum value at stage 0 
and recording the choice of smaller input for each PLJ at 
each stage. After that, a backward operation implemented by 
a series of parity transmit unit (PTU) can help to locate 
the minimum one among the length N cons tituent 

polar codes. Design details are illustrated in section IIII-CI 
The estimation of current bit in SC decoding is bases on the 
information of previous decoded bits (/3). This information is 
also called partial sum. Thus, a partial sum generator (PSG) 
which can co-operate with decod ing pipeline is also needed. 
We adopt the PSG introduced in Ifla in our design, and it is 
compatible with our system. Thus, the design of PSG is not 
discussed in this paper. 


B. Dataflow, latency and flexibility analysis 

In terms of tree presentation, SC decoder conventionally 
process one node in each clock cycle. Traversal of a subtree 
contained N leaf nodes needs 2N — 2 clock cycles. By using 
pre-computation as introduced in 0, which calculate the / 
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Fig. 3. Overview of proposed system when code length = 16 
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Fig. 4. |(a)| Design details of PU, |(b)| Design details of PUq 


function and all the possible result of g functions in the same 
clock cycle, the latency can be reduced to N— 1. In our design, 
if this subtree is belong to fast constituent polar codes, the 
latency can be further reduced. 

For the /3„ are all set to 0, and for J\f^, the 13^, 
are determined by hard decision of input LLRs. Both of the 
two computations need only one clock cycle after they are 
activated. For , according to Eq. (|5]l, Eq. (|6l), and Eq. (|2l), 
only three operations needed. Finding the minimum LLR can 
be done by a comparator tree, which is naturally existed in SC 
decoder with tree architecture since every PU has a comparator 
for Eq. ([T]i- For N LLRs, finding the smallest one use Log 2 N 
clock cycles. Meanwhile, we can obtain the parity bit when 
the minimum LLR is found, which will be explained in the 
next subsection. After that, one more clock cycle is need for 
signal parity check which is done by a XOR gate. Thus, 
totally, decoding a length N constituent polar codes 

need Lo(/ 2 -V+l clock cycles. For according to Eq. (O, 

an accumulation operation is needed. Similar to the comparator 
tree, an adder tree also exists in SC decoder within the tree 
architecture since every PU has an adder for Eq. (|2]l. Eor a 
length N Af^^^ constituent polar code, it needs Log 2 N clock 
cycles to decode. 

A/”” and Af^ have time complexity 0(1) and Af^^^ and 
j\fREP complexity 0{log2N). Compared with com¬ 

monly discussed SC architecture in 0, 0 and 0, which all 
have linear time complexity 0{N), we can benefit significantly 
from proposed scheduling scheme in term of latency, especially 
with very large N. The latency reduction of A^ = 1024 polar 
code with different rate will be presented in the next section. 

The main challenge for fast-SSC decoder is that the ar¬ 
chitecture subject to the rate of codes. This is due to the 
reason that polar codes with different rates do not have the uni¬ 
form distribution of constituent polar codes. Proposed design 
overcomes this obstacle by exploring the similarity between 
the decoding architecture of fast constituent and regular polar 
codes. The specific designed PU allows the tree architecture 
to deal with both fast constituent and regular polar codes, 
which means the entire decoding processing can run smoothly 
no matter what the distributions of constituent codes are. 
This architecture is independent and does not relay on the 
distribution of constituent codes. This property provides the 


flexibility for multiple rates. To switch from one rate to another 
rate, only the control signals for given PUs need to be modified. 

C. Processing Unit Design 

Pig. |4a] shows design details of PU. A sin^ PU can 
perform / and g functions in Eq. ([T]i and Eq. y]), respec¬ 
tively. Also a PU tree can help to find the minimum val¬ 
ues or do accumulation for multiple inputs. In Fig. |4al S 
stands for signed magnitude number and C stands for 
2's complement number. Unlike the PU design in 0, in 
which data are initially stored as signed magnitude form, our 
design use 2’s complement as initial form. We do this for 
two reasons. 1). According to synthesis result, the critical 
path of PU is along with the g function path. By moving 
number system convert modules to the / function path, which 
means using 2’s complement as initial data form, the critical 
path is still along with g function path, but with significant 
reduction. 2). Compared with four number system convert 
modules are used in 0, only three are used if use 2’s 
complement number. This is more hardware efficient. The 
benefits of this modification can be seen in section lYl 

For each PU, two LLRs are fed simultaneously. Since 
we use the pre-computation technique, / and g functions are 
calculated at the same time, and which one needs to be output 
is determined by mode select 2. According to Eq. ©, there 
are only two types of possible results for g function, sum or 
difference. Its final result depends on the corresponding partial 
sum. So two registers are used here to hold the most recently 
computed values until the corresponding partial sum is calcu¬ 
lated. When it calculates the sum for decoding only ad¬ 

ditions are needed. The datapath is decided by Mode select 1 
signal. When / function is performed, according to Eq. ([T]l, 
both 2 inputs are divided into two parts; sign bit and unsigned 
number. Each part is processed separately first, and then results 
of two parts are combined together to obtain the updated 
value. C to S and S to C modules are needed before and 
after comparisons, respectively. When it deals with Af^^^, 
the result of comparison should be recorded using a register as 
the select signal for PTU. Since the processing of searching 
minimum value lasts several clock cycles, there should be a 
feedback of the register to hold this value for the later clock 
cycles. The input source is chosen by Mode select 3 signal. 































































































































Fig. 5. Effect of quantization on the BER/FER perfonnance of (1024, 512) 
code 


Since every PU does exclusive or operation to the sign bit 
of two inputs, according to Eq. Q, the sign bit of the final 
value in stage 0 should be equal to the parity. Eq. 0 can 
be performed using an XOR gate. The PU that contains the 
minimum LLR receives the parity check bit and the others 
receive Os. The transmission of parity check bit is done by the 
PTU which is a two input two output module. One input is 
the parity check bit (PCB) and the other is the select signal 
(SS). The parity check bit is transmitted via output 1 (Ol) or 
output 2 (02) bases on the values of SS. Table. U shows the 
truth table of PTU. We can obt ain t he logic expression of Ol 
and 02 as: Ol — PCB and SS , 02 = PCB and SS. 
This can be done by two and gates and one Inverter. 


TABLE I. Truth table of PTU 


PCB 


(Tl^ 


PCB 


(Tl^ 


0 

0 

0 

0 

1 

0 

1 

0 

0 

1 

0 

0 

1 

0 

0 

1 


The PU in staaeO. as denote PUq in Fig. [3 has a simpler 
architecture. Fig. |4b] shows the design details of PUq. Since 
only one more clock cycle need for single parity check, there is 
no feed back to this register. Furthermore, cannot exist 

in stageO. So top part in Fig. |4a] which is relative to single 
parity check can be removed. For g function and , the 

output of / function can be feed back to it immediately, and 
the sign bit of the result of adding is the partial sum for . 

D. Fixed point analysis 

Fig. 13 shows the effect of quantization on the (1024, 512) 
polar code. For channel outputs and inner LLRs, we use 
separate quantization schemes. The quantization schemes are 
shown in (C, L, F) format. Where C, L and F are the number 
of bits used for presenting channel output, inner LLRs and frac¬ 
tion parts of both channel output and LLRs, respectively. Since 
no multiplication or division used, which means the length 
of fraction does not change, channel outputs and inner LLRs 
use the same fraction precision. As the result of the trade-off 
between hardware efficiency and decoding performance, we 
choose (4, 5, 0) quantization scheme in our design. 

IV. Hardware Analysis and Comparison 

In this section, the comparisons between proposed design 
and other state-of-the-art designs are given, and synthesis 
results using Nanqate FreePDK Abnm process are also 
presented. Table. [IIJ shows the hardware comparison of differ¬ 
ent (n, k) SC decoders with g-bit quantization for inner LLRs 


TABLE 11. Hardware comparison of different (n, k) SC 

DECODER WITH g-BIT QUANTIZATION FOR INNER LLRS USING TREE 
ARCHITECTURE 


Hardware lype 

m 

m 

—— 

Proposed Design 

# of PU 

n — 1 

n — 1 

n — 1 

n — i 

# of P'l'U 

0 

0 

0 

2/n - 1 

# of I bit REG 

55S 3qn 

~ qn 

^ 3qn 

R! {3q + l)n 

H(_ 

i.3 

1 

i.3 

i.di 

Latency (clock cycle) 

n — 1 

2n — 2 

0.75n - 1 

Ri (0.1 ~ 0.3)n 

Throughput 

‘2 

1 

2.57 

~ 5.59 ~ 22.25 

Throughput/HC 

i.bd 

1 

TTT 

5.1 ~ 15.99 



Fig. 6. Latency Reduction vs. Code Rate 


using tree architectures. All the throughputs and hardware 
complexity (HC) are normalized to the SC decoder in ||4l, and 
the hardware complexity is estimated based on the synthesis 
results. The latency for proposed design is a range with respect 
to the code rates change from 0.05 to 0.95. From this table, 
we can see that our proposed design achieves the highest 
throughput per unit of hardware complexity. The exact latency 
depends on the code rate. Fig.|6]shows the latency reduction of 
the proposed design along with code rates from 0.05 to 0.95. 
The reduction is relative to the 2b-SC-Precomputation decoder 
which so far is known to be the fastest. The figure shows at 
least 60% latency reduction can be achieved by our proposed 
design. This is very promising for many applications where 
high rate channel codes are needed, such as for data storage 
system. 

Additionally, we implemented the proposed design with 
Verilog for the polar code with length=1024 and synthe¬ 
sized it using Nangate FreePDK 45nm process with 
Synopsys Design Complier. We calculated the throughput 
for (1024,870) and (1024,512) polar codes. Table Hill shows 
the synthesis result for (1024,870) and (1024,512) polar 
codes. Notice that the maximum frequency is higher than that 
reported in fS] which use the same process as our design. Our 
design in theory should have a lower maximum frequency since 
we have one more Mux delay for regular and fast constituent 
polar codes. This performance improving is attributable to the 
modification we have done to PU as described in section ffll-CI 

TABLE III. Synthesis result for (1024, 870) and (1024, 512) 
POLAR codes 


Silicon Area {^m^) 

275899 

Max Frequency (GHz) 

071 

Latency (1U24,8/(J) (clock cycle) 

156 

Throughtput( 1024,870) (Gbps) 

5.81 

Latency (1024,512) (clock cycle) 

266 

Throughtput(1024,512) (Gbps) 

2.01 


V. Conclusion 

In this paper, we proposed a hardware architecture of fast- 
SSC algorithm for polar codes. By exploiting the similarity 
between the decoding processing of fast constituent and regular 
polar codes, proposed design overcomes the disadvantage of 
fast-SSC decoder that lacking decoding flexibility with respect 
to multiple code rates. Corresponding scheduling plan and the 
intendedly designed PU are also described. Result shows that 
proposed design significantly increase the decoding throughput 
of polar codes compared with other state-of-art SC decoders. 
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